Evaluating Structural Similarity in XML Documents

نویسندگان

  • Andrew Nierman
  • H. V. Jagadish
چکیده

XML documents on the web are often found without DTDs, particularly when these documents have been created from legacy HTML. Yet having knowledge of the DTD can be valuable in querying and manipulating such documents. Recent work (cf. [10]) has given us a means to (re-)construct a DTD to describe the structure common to a given set of document instances. However, given a collection of documents with unknown DTDs, it may not be appropriate to construct a single DTD to describe every document in the collection. Instead, we would wish to partition the collection into smaller sets of “similar” documents, and then induce a separate DTD for each such set. It is this partitioning problem that we address in this paper. Given two XML documents, how can one measure structural (DTD) similarity between the two? We define a tree edit distance based measure suited to this task, taking into account XML issues such as optional and repeated sub-elements. We develop a dynamic programming algorithm to find this distance for any pair of documents. We validate our proposed distance measure experimentally. Given a collection of documents derived from multiple DTDs, we can compute pair-wise distances between documents in the collection, and then use these distances to cluster the documents. We find that the resulting clusters match the original DTDs almost perfectly, and demonstrate performance superior to alternatives based on previous proposals for measuring similarity of trees. The overall algorithm runs in time that is quadratic in document collection size, and quadratic in the combined size of the two documents involved in a given pair-wise distance calculation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...

متن کامل

The Impact of Ontology on the Performance of Information Retrieval : A Case of

The large amount and heterogeneity of XML documents on the Web requires the development of clustering techniques to group together similar documents. Documents can be grouped together according to their content, their structure, and the links inside and among the documents. For instance, grouping together documents with similar structure has interesting applications in the context of informatio...

متن کامل

A Novel Approach to Measuring Structural Similarity between XML Documents

Measuring structural similarity between XML documents has become a key component in various applications, including XML mining, schema matching, and web service discovery, among others. This paper presents a novel structural similarity measure incorporating kernel methods into XML documents. Results on preliminary simulations show that this approach outperforms conventional ones.

متن کامل

Similarity Metric for XML Documents

Since XML documents can be represented as trees, Based on traditional tree edit distance, this paper presents structural similarity metric for XML documents ,which is based on edge constraint, path constraint, and inclusive path constraint, and similarity metric based on machine learning with node costs. It extends scope for searching XML documents, and improves recall and precision for searchi...

متن کامل

A Structural Similarity Measure for XML Documents: Theory and Applications

XML (eXtendible Markup Language) has recently emerged as the most relevant standardization effort in the area of markup languages, and it is increasingly used as the language for information representation and exchange over the Web. An important feature of XML is that information on document structures is available on the Web together with the document contents. This information can be exploite...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002